Statistical Parsing by Machine Learning from a Classical Arabic Treebank
نویسنده
چکیده
Research into statistical parsing for English has enjoyed over a decade of successful results. However, adapting these models to other languages has met with difficulties. Previous comparative work has shown that Modern Arabic is one of the most difficult languages to parse due to rich morphology and free word order. Classical Arabic is the ancient form of Arabic, and is understudied in computational linguistics, relative to its worldwide reach as the language of the Quran. The thesis is based on seven publications that make significant contributions to knowledge relating to annotating and parsing Classical Arabic. Classical Arabic has been studied in depth by grammarians for over a thousand years using a traditional grammar known as i’rāb (ةاغعإ). Using this grammar to develop a representation for parsing is challenging, as it describes syntax using a hybrid of phrase-structure and dependency relations. This work aims to advance the state-of-the-art for hybrid parsing by introducing a formal representation for annotation and a resource for machine learning. The main contributions are the first treebank for Classical Arabic and the first statistical dependency-based parser in any language for ellipsis, dropped pronouns and hybrid representations. A central argument of this thesis is that using a hybrid representation closely aligned to traditional grammar leads to improved parsing for Arabic. To test this hypothesis, two approaches are compared. As a reference, a pure dependency parser is adapted using graph transformations, resulting in an 87.47% F1-score. This is compared to an integrated parsing model with an F1-score of 89.03%, demonstrating that joint dependency-constituency parsing is better suited to Classical Arabic. The Quran was chosen for annotation as a large body of work exists providing detailed syntactic analysis. Volunteer crowdsourcing is used for annotation in combination with expert supervision. A practical result of the annotation effort is the corpus website: http://corpus.quran.com, an educational resource with over two million users per year. ِيحِ ه رم ٱ نِػٰ َ حْْ ه رم ٱ ِ ه للَّ ٱ مِسْبِ ُيكِحَْمإ يُلِعَْمإ تَهٱَ مَه ه ِ إ اَنتَمْه لَع امَ ه لَ ِ إ اَنَم َ لْْعِ لََ مََهاحَبْ ُ س „Glory be to thee! We have no knowledge except what you have taught us. Indeed it is you who is the all-knowing, the all-wise.‟ A prayer of the angels –The Quran, verse (2:32)
منابع مشابه
تصحیح خودکار خطا در درخت بانک نحوی با استفاده از یادگیری ماشینی انتقال محور
The Treebank is one of the most useful resources for supervised or semi-supervised learning in many NLP tasks such as speech recognition, spoken language systems, parsing and machine translation. Treebank can be developded in different ways that could be, generally, categorized in manually and statistical approaches. While the resulted Treebank in each of these methods has the annotation error,...
متن کاملSupervised learning model for parsing Arabic language
Parsing the Arabic language is a difficult task given the specificities of this language and given the scarcity of digital resources (grammars and annotated corpora). In this paper, we suggest a method for Arabic parsing based on supervised machine learning. We used the SVMs algorithm to select the syntactic labels of the sentence. Furthermore, we evaluated our parser following the cross valida...
متن کاملARSYPAR: A tool for parsing the Arabic language based on supervised learning
In this paper, we present a tool for parsing the Arabic language based on supervised machine learning. The used algorithm for the learning phase is the support vector machine. We also used the Penn Arabic Treebank as a learning corpus. Furthermore, we evaluated our parser following the cross validation method. The obtained results are very encouraging. We give at the end our vision to ameliorat...
متن کاملStatistical Dependency Parsing of Four Treebanks
Multilingual dependency parsing is gaining popularity in recent years for several reasons. Dependency structures are more adequate for languages with freer word order than the traditional constituency notion. There is a growing availability of dependency treebanks for new languages. Broad coverage statistical dependency parsers are available and easily portable to new languages. Dependency pars...
متن کاملWhy is German Dependency Parsing More Reliable than Constituent Parsing?
In recent years, research in parsing has extended in several new directions. One of these directions is concerned with parsing languages other than English. Treebanks have become available for many European languages, but also for Arabic, Chinese, or Japanese. However, it was shown that parsing results on these treebanks depend on the types of treebank annotations used [ , ]. Another direction ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1510.07193 شماره
صفحات -
تاریخ انتشار 2013